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ABSTRACT 

In this paper, we propose causality as a unified framework 
to explain query answers and non-answers, thus generaliz- 
ing and extending several previously proposed approaches of 
provenance and missing query result explanations. 

We develop our framework starting from the well-studied 
definition of actual causes by Halpern and Pearl [13]. After 
identifying some undesirable characteristics of the original 
definition, we propose functional causes as a refined defi- 
nition of causality with several desirable properties. These 
properties allow us to apply our notion of causality in a 
database context and apply it uniformly to define the causes 
of query results and their individual contributions in several 
ways: (i) we can model both provenance as well as non- 
answers, (ii) we can define explanations as either data in 
the input relations or relational operations in a query plan, 
and (iii) we can give graded degrees of responsibility to indi- 
vidual causes, thus allowing us to rank causes. In particular, 
our approach allows us to explain contributions to relational 
aggregate functions and to rank causes according to their 
respective responsibilities. We give complexity results and 
describe polynomial algorithms for evaluating causality in 
tractable cases. Throughout the paper, we illustrate the 
applicability of our framework with several examples. 

Overall, we develop in this paper the theoretical founda- 
tions of causality theory in a database context. 

1. INTRODUCTION 

When analyzing data sets and domains of interest, users 
are often interested in explanations for their observations. 
In a database context, such explanations concern results to 
explicit or implicit queries. For example, "Why does my 
personalized newscast have more than 20 items today?" Or, 
"Why does my favorite undergrad student not appear on 
the Dean's list this year?" Database research that addresses 
these or similar questions is mainly work on lineage of query 
results, such as why [8] or where provenance [3], and very 
recently, explanations for non-answers 11 Tl HI . While these 



approaches differ over what the response to questions should 
be, all of them seem to be linked through a common underly- 
ing theme: understanding causal relationships in databases. 

Humans usually have an intuition about what constitutes 
a cause of a given effect. In this paper, we define the fun- 
damental notion of functional causality that can model this 
intuition in an exact mathematical framework, and show 
how it can be applied to encode and solve various causality 
related problems. In particular, it allows us to uniformly 
model the questions of Why so? and Why no? with re- 
gards to query answers. It also effectively allows us to repre- 
sent different approaches taken so far, thus illustrating that 
causality is a critical element unifying important work in 
this field. 

We start with a simple illustrative example. 

Example 1.1 (News feed). A user has a personal- 
ized news feed that filters incoming news based on matching 
predefined tags. Let relation K(tag) represent the table with 
the user-defined tags, N(nid, story, tag) the incoming news, 
and P(nid, story) the personalized news feed. For simplicity, 
we assume one single tag per news item and ignore times- 
tamps. P can then be represented by the query 

create view P 

as select N.nid, N. story 

from N 

where exists ( select * 
from K 

where K.tag=N.tag) 

As a result, the view P will be a collection of news match- 
ing the user's preferences as shown in \Fig. I) The user may 
now ask questions about this view. For example, "Why am 
I getting so many stories about Indianapolis?" (5 in total). 
The system should answer that the user's keywords DB_conf , 
Purdue, and Movies are causes with some kind of decreasing 
responsibility. On the other hand, the user may have heard 
that there should be far more news feeds on Indianapolis this 
week and wonders "Why am I NOT getting MORE stories 
about Indianapolis?" The system should suggest the lack of 
the keyword Indy_500 in the user-defined relation K as pos- 
sible cause (inserting it would increase the count from 5 to 
8 articles on Indianapolis). 
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As illustrated in |Example 1.1[ we want to allow users to 
ask simple questions based on the results they receive, and 
hence, allow them to learn what may be the cause of any 
surprising or undesirable answer. Such questions can refer 
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N(cws feeds) 



nid 


story 


tag 


1 


... the race of Indianapolis this year may ... 


Indy_500 


2 


... economic downturn affected sensitive ... 


Business 


3 


... with sequences shot in Indianapolis ... 


Movies 


4 


... when President Obama meets former ally ... 


Obama 


5 


... Indianapolis officials debating the budget ... 


Purdue 


6 


... most amazing event m the Uo with tew ... 


Burning_man 




... PODS held in Indianapolis this year ... 


TIB ^i-,n-f 
UD_COIl± 


8 


... discussed in a recent talk the options to ... 


Politics 


9 


... VLDB conference this year in Singapore ... 


DB_conf 


10 


... at the the Indianapolis Motor Speedway ... 


Indy 500 


11 


... Indianapolis host to SIGMOD/PODS ... 


DB_conf 


12 


... SIGMOD in Indianapolis promises to be ... 


DB.conf 


13 


... more people in Indianapolis this year ... 


Indy_500 


14 


... recent ranking held positive surprises for ... 


Purdue 



K(ey words) 


P(crsonalized news) 


tag 




nid 


story 


Obama 




3 


... with sequences shot in Indianapolis ... 


DB.conf 




4 


... when President Obama meets former ally ... 


Purdue 




5 


... Indianapolis officials debating the budget ... 


Burning_man 




6 


... most amazing event in the US with few ... 


Movies 




7 


... PODS held in Indianapolis this year ... 


Afghanistan 




9 


... VLDB conference this year in Singapore ... 




11 


... Indianapolis host to SIGMOD/PODS ... 




12 


... SIGMOD in Indianapolis promises to be ... 




14 


... recent ranking held positive surprises for ... 



Figure 1: Example of a personalized news- feed (P) 
as a result of a query filtering all news (TV) based on 
user-defined keywords (K). 



to either presence (Why so?) or absence (Why no?) of 
results. Furthermore, the user should be provided with a 
ranking of causes based on their individual contribution or 
responsibility. Our ultimate goal is to define a language that 
allows users to specify causal queries for given results. In 
this paper, we lay the theoretical groundwork and define a 
formal model that allows us to capture such causality-related 
questions in a uniform framework. 

Summary and outline. |Section ~2\ analyzes causality in 
Boolean networks in general: We start by reviewing exist- 
ing definitions of counter] 'actual and actual causes (Sect. 2.1 
to[2~2l 



We also illustrate problems of these previous def- 
initions, and propose functional causes as a refined notion 
of causality that mitigates these problems (Sect. 2.3 1. In 
|Sect. 3| we then describe and prove several desirable prop- 
erties of functional causes. We also give complexity results 
for general and restricted Boolean networks. |Section "4| ap- 
plies our general framework to give Why so? and Why no? 
explanations to database queries. We show that our unify- 
ing framework generalizes provenance as well as non-answers 
(Sect. 4.1 1, handles contributions to aggregate functions by 



ranking causes according to their responsibilities for the re- 
sult (Sect. 4.2|, and can also model causes other than tuples 
( |Sect. 4.3p We discuss related work in |Sect. 5| point out 



some directions for future work (Sect. 6 1, and give detailed 



proofs and elaborated examples in the appendix. 

2. CAUSALITY 

This section discusses the two most established notions of 
causality, then our new definition. The first is the notion 
of counter] "actual causes, which is intuitive and simple, but 
very limited in its applicability. The second is the definition 
of actual causes by Halpern and Pearl (HP from now on), 



which can better reproduce common-sense causal answers 
and has become central in the causality literature. We then 
give our definition of functional causes which is a refinement 
of the HP definition that can model more cases correctly and 
has additional desirable properties for database applications. 

General notions. We assume a set of Boolean random 
variables which model a causal problem. A capital letter 
(e.g. X) denotes a variable, and a lower case letter with 
exponent or 1 (e.g. a; ) denotes a truth value. An event 
is a truth value assignment to one or more variables (e.g. 
X = x°). We use the vector sign (e.g. X) to denote an or- 
dered or unordered set, depending on the context. A causal 
model M is a tuple (TV, with TV representing a set of vari- 
ables, and T = {i<jv|TV £ TV} a set of structural equations 
F N : {0, l}l p «l -> {0, 1} that assign a truth value to TV for 
each value of its parents Pn C TV \{TV}. The causal network 
(CAT) is the directed acyclic graph representing the depen- 
dencies between the Boolean variables (like in a Bayesian 
network) . We call nodes without parents input variables and 
the rest dependent variables, denoting them with X and Y, 
respectively. We associate to each dependent variable Y a 
Boolean formula that determines its truth value Y(X) based 
on the values of the input variables. The Boolean formula of 
a distinguished effect variable is denoted as &(X). The effect 
4> represents the event that the effect variable has its current 
assignment (f> = (&(X) = Causality is always deter- 

mined for a given actual assignment x°. The causal path is 
the set of all descendants of a variable under consideration. 
An external intervention [S <— s 1 ] for S C TV considers a 
modified causal model where each node TV £ S is assigned a 
truth value n 1 that replaces its structural equation Fn- 

2.1 Counterfactual Causes 

With deep roots in philosophy [18], the argument of cou- 
nterfactual causality is that the relationship between cause 
and effect can be understood as a counterfactual statement, 
i.e. an event is considered a cause of an effect if the effect 
would not have happened in the absence of the event. 

Definition 2.1 (Counterfactual Cause [22]). The 
event X = x° is a cause of (f) in a causal model M iff: 

CC1. X = x° 

CC2. [X^^x ] => ^ 

Example 2.2 (one thrower). Alice throws a rock at 
a bottle and the bottle breaks. If Alice had not thrown the 
rock, then the bottle would not have broken. Therefore, Alice 
throwing the rock is a cause of the bottle breaking. 

Shortcomings of Counterfactual Causes. Counter- 
factual causality cannot handle slightly more complicated 
scenarios such as disjunctive causes, i.e. when there are two 
potential causes of an event. 

Example 2.3 (two throwers pT]). Alice and Bob 
each throw a rock at the bottle and it breaks. Had Alice not 
thrown the rock, the bottle would still have broken. According 
to the counterfactual definition, Alice's throw is not a cause 
even though common sense suggests she should be. 

|Figure 2a| show s an example of a simple causal network 
for |Example 2.3] The events of Alice and Bob throwing 
rocks are modeled with truth value 1 for variables A and 
B respectively, while Y models the effect variable (i.e. the 
bottle breaking (f>) which is true if either A or B is true. 
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A = U 



B = U 



Y = AwB 



A=l 



B- 



(a) 




Figure 2: Two throwers. In both models, the bottle 
breaks (Y = l) if either Alice throws (A—l) or Bob 
throws (B = l). Model b encodes the preemption of 
Bob's throw by Alice with an additional intermedi- 
ate variable Y\ for Bob hitting the unbroken bottle. 

2.2 Actual Causes 

The HP definition of causality [13] is based on counterfac- 
tuals, but can correctly model disjunction and many other 
complications. The idea is that X is a cause of Y if Y coun- 
terfactually depends on X under "some" permissive contin- 
gency, where "some" is elaborately defined. This definition 
is significant in causality theory. We present it here in an 
abbreviated way and refer to [l3] for details. Note that the 
HP definition allows subsets of both input variables X and 
dependent variables Y to be a cause of <f>- In the following, 
N c refers to a subset from all nodes of a network. 

Definition 2.4 (Actual Cause [13], Def 3.1). The 

event N c (x°) = ft® is a cause of (j) in a causal model M iff: 



AC1. 
AC2. 



Both N c (x°) = n° and <f> hold under assignment x° 
There exists a partition (Z, W) of N with N c C Z 
and an assignment (nj, w 1 ) of the variables (N c , W), 
such that the following two conditions hold: 

(a) [N c <-nl,W^- w 1 ] => ^ 

(b) [N c <- n°,W' <- w n ,Z' <- z°] => <f>, for all 
subsets W' CW and Z> C Z 

AC3. N c is minimal, i.e. no subset N' c C N c is a cause. 

The heart of the definition is condition AC2, which is effec- 
tively a generalization of counterfactual causes. The require- 
ment is that there exists some assignment of the variables for 
which N c is counterfactual, and that this assignment does 
not make any fundamental changes to the causal path of N c . 

The HP definition correctly handles disjunctive causes as 
in |Example 2.3| recognizing both Alice's and Bob's throws 
as causes. Its use of the causal network makes it very flexible 
in capturing different scenarios of causal relationships. For 
example, it is easy to model preemption, i.e. when there are 
two potential causes of an event and one preempts the other. 

Example 2.5 (two throwers continued). Assume 
that Alice's rock hits the bottle first. Then Alice's throw 
would be considered a cause of the bottle breaking, but not 
Bob's. This precedence of Alice's throw is not encoded in the 
network of Fig. 2a (model a). It can be modeled by adding 
the variable Y\ in \Fig. 2b\ (model b): The bottle breaks if ei- 
ther Alice throws, or if Alice doesn't throw and Bob throws 
(Yi = 1). The Boolean formulas for the effect, $(X) = AV B 
or <fr(X) — A\/ AB for models a and b respectively, are equiv- 
alent, but the causal relevance of variable B is not: Bob's 
throw is an actual cause in model a, but no in model b. 

This result is intuitive, because Alice's rock hits the bottle 
first, breaking it and preempting that Bob can hit and break 



it. While there exists an assignment of variables (A <— 0) 
that makes Bob's throw (B — 1) counterfactual, this assign- 
ment changes the value of node Yi from to 1, establishing 
a change in the causal path of B. Since there is no path 
from B to Y that doesn't go through Yi, B is not a cause. 

Shortcomings of Actual Causes. The HP definition 
of actual cause is well established in the causality literature, 
but it does not correctly handle some cases, leading to non- 
intuitive results. The following is a well-studied example (see 
|22|), originally given by McDermott 21 , for which the HP 



definition does not match common sense, i.e. the commonly 
accepted interpretation in philosophical circles. 



B = A 



A=l 




C=(A = B) 



Figure 3: Shock C. Simple example where actual 
causality fails to match common-sense: A is deter- 
mined a cause of C according to the HP definition, 
although C is always true. 



Example 2.6 (Shock C [21]). Shock C is a game for 
three players. A and B each have a switch which they can 
move to the left or right. If both switches are thrown into 
the same position, a third person C receives a shock. A does 
not want to shock C. Seeing B's switch in the left position, 
A moves his switch to the right. B wants to shock C. Seeing 
A 's switch thrown to the right, she now moves her switch to 
the right as well. C receives a shock. Clearly, A 's move was 
a cause of B's move, and B's move was a cause of C's shock, 
but A 's move was not a cause of C's shock. 

This example can be modeled with the causal network from 
\Fig. 3\ and structural equations 

B — A 

C — (A = B) — AB V AB 

under actual assignment A=l, and hence B — 1, C =1. The 
effect (j> under consideration is C = 1. Here, and contrary 
to common sense, A = l is an actual cause of C = 1; Take 
W = {B} with b 1 = 1. Then AC2(a) holds: [A <- 0, B <- 
1] =► -10. Also AC2(b) holds: Z\N C \{C} is empty, and for 
either W' = W or W 1 ' = 0, C is 1 because of A*— 1. Hence, 
the HP definition does not deliver the common-sense answer 
for this example, making A the cause of a tautology. 



|Appendix B.l| gives more details on the Shock C example, 
and shows that the HP definition cannot handle this exam- 
ple even with a more elaborate causal network, while the 
following definition of functional causality can. 

2.3 Functional Causes 

A fundamental challenge in applying causality to queries 
is that causality is defined over an entire network: it is not 
enough to know the dependency of the effect on the input 
variables, we also need to reason about intermediate depen- 
dent nodes. This requirement is difficult to carry over to 
a database setting, where we care about the semantics of a 
query rather than a particular query plan. Our approach 
is to represent a causal network with two appropriate func- 
tions that semantically capture the causal dependencies of a 
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Figure 4: FC framework: the causal network is par- 
titioned into the input variables X with cause under 
consideration Xi, and dependent variables Y with 
effect variable Yj. Support SC X\{Xi} corresponds 
to permissive contingency from the HP framework. 



network. The two key notions we need for that are potential 
functions and dissociation expressions. 

|Figure 4] represents a causal network in our framework. 
In contrast to the HP approach, only input variables from 
X can be causes and part of permissive contingencies. As 
in the HP approach, every dependent node Y is described 
by a structural equation Fy, which assigns a truth value 
to Y based on the values of its parents. The Boolean for- 
mula *y of Y defines its truth assignment based on the 
input variables X, and is constructed by recursing through 
the stru ctural equations of Y's ancestors. For example, in 



Fig. 2b 



= A V (A A B), where X = {A, B}. We 
$y- {X), where Yj is the effect node, and 



denote as $(X) - 
we say that the causal network has formula *. The potential 
function Pj> is then simply the unique multilinear polyno- 
mial representing *. It is equal to the probability that * is 
true given the probabilities of its input variables. 

Definition 2.7 (Potential Function). The poten- 
tial function P<s>(x) of a Boolean formula &(X) with proba- 



bilities x = {x\, 
follows: 

ft (2) = J2 

£^{0,1} S 



, Xk} of the input variables is defined 



n 



X; % = 



*/£» = 



The potential function is a sum with one term for each 
truth assignment e of variables X. Each term is a product 
of factors of the form i, or 1- Xi and only occurs in the sum 
if the formula is true at the given assignment ($(e) = 1). 
For example, if $ = X 1 A(X 2 VX 3 ) then Pg> = Xix 3 (l — x 3 ) + 
aci(l— aja)aJ8+ 2:1X2X3, which simplifies to xi(x 2 +x 3 — x 2 x 3 ). 

We ground our framework on potential functions because 
they allow us to extend functional causes to probabilistic 
databases, a topic that we briefly discuss in |Sect. 6| For the 
deterministic settings of this paper, we use delta notation to 
denote changes AP in the potential function due to changes 
in the inputs: Given an actual assignment x° and a subset of 
variables S, we define AP$(S) := P*(f°)-P<i>(x eS), where 

OS (denoting XOR) indicates the assignment obtained by 
starting from x° and inverting all variables in S. 

We use dissociation expressions (DE) to semantically cap- 
ture differences in causality between networks with logically 
equivalent boolean formulas (e.g. |Fig. 2\ : 

Definition 2.8 (Dissociation Expression). A dis- 
sociation expression with respect to a variable Xq is a Boolean 



expression defined by the grammar: 

* ::=X G X 

* ::=<T(*i,* 2 ,.-.,*fe)i 

X G U V(Vj) => V(*0 n V(Vj) C {X } 

where V(*i) is the set of input variables of formula *j. 

Dissociation expressions allow us to semantically capture 
with a boolean formula the effect of a variable along dif- 
ferent network paths, by disallowing a variable from being 
combined with Xq in more than one subexpression. For ex- 
ample, in the network of |Fig. 5a| variable A contributes to 
the causal path of B at two locations. This "independent" 
influence can be represented by the dissociation expression 
* = Ax V (A2 A B), which essentially separates A into two 
variables. *' = AV (Aa B) is not a valid DE with respect to 
B, because for its subexpressions, *i = A and * 2 — A A B, 
it is B G V(%) U V(%) but V(*i) n V(%) = {A\1{B}. 
Note however that *' is a DE w.r.t A, as no variable is 
combined with A in more than one subexpression. 

We demonstrate how * captures semantically the net- 
work structure: to check actual causality of B in the net- 
work of |Fig, 5a] we need to determine the value of Y for 
a setting {A — 0,B — 1} while forcing Y\ to its original 
value, as part of condition AC2(b). The dissociation expres- 
sion i£(A 1 ,A2,B) = Ai V (A 2 A B), with potential function 
P*(ai, a 2 , b) = ai+fe— a\b— a 2 b+aia 2 b, allows us to perform 
the same check by simply computing P*(0, 1,1). In this case 
P$(0, 1, 1) = 7^ P*(l, 1, 1), which was the original variable 
assignment, meaning that the change in assignment altered 
values on the causal path. 

To link dissociation expressions to a boolean formula of a 
causal network, we define expression folding. 

Definition 2.9 (Expression Folding). Given func- 
tion f : X' — > X mapping variables X' to X , the folding 
(J-,f) of a dissociation expression ^f(X') defines a formula 
<E> = P(*), s.t: 



= f(X') 



:<7(^(*l),^(*2),.-.,^(*fe)) 



For example, f({Ai,A 2 ,B}) = {A, A, B} defines a folding 
^from * = A!V{A 2 AB) to the formula $ = Av{AaB). In 
simple terms, a DE ^ with a folding to $, is a representation 
of $ in a larger space of input variables. The use of more 
inputs captures the distinct effect of variables on the causal 
path, thus providing the necessary network semantics. We 
use |^| to denote the cardinality of the input set of ^. Then 
|*| > |$|, and if |*| = |*| then * = *. 

Theorem 2.10 (DE Minimality). IfV the set of all 
DEs w.r.t. Xq G X with a folding to *(X), then 3 unique 
*i G T> of minimum size: |*i| = min 1*1 anrfV? 7^ i, |*, | = 

|* 4 |=>*i = *i. 



The DE of minimum size replicates those variables, and only 
those variables, that affect the causal path at more than one 
location. It is simply called the dissociation expression of *, 
and can be represented as a networ k (dissociation network of 
*), with input nodes X t (Fig. 5b 1. A folding maps Xt back 

f(Xt). The reverse 

ex}. 



to the original input variables: X 
mapping is denoted Xt = [X]t = {Xi | f(Xi 
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A = U 



B = lc 



A,=l 



Y = A\I AB 




Y = A 1 VA 2 B 



Yj=AB 
(a) 



Figure 5: A causal network CN (a) and its dissocia- 
tion network DN (b) with respect to B. 

Definition 2.11 (Functional Cause). The event 
Xi=x( is a cause of (j> in a causal model iff: 

FC1. Both Xi=Xi and <fi hold under assignment x° 
FC2. Let Pj> and P* be the potential functions of $ and 
its DE w.r.t. Xi, respectively. There exists a sup- 
port S C X\{Xi}, such that: 

(a) AP*(S,Xi)^0 

(b) AP 9 (S' t ) = 0, for all subsets S' t C [S] t 

Here, AP 4 (S, Xi) denotes AP^(SUXt). Condition FC2(b) 
is analogous to AC2(b) of the HP definition, which requires 
checking that the effect does not change for all possible com- 
binations of setting the dependent nodes to their original 
values. Similarly, FC ensures that no part of the changed 
nodes (the support S) is counterfactual in the dissociation 
network. Note that the functional causality definition does 
not have a minimality condition (equivalent to AC3), as it 
is directly applied to single literals. As implied by [9] and 
[12] , only primitive events can be causes when dealing with 
input variables, and therefore a minimality condition is not 
necessary. 

Intuition. The definition of functional causes captures 
three main points: (i) a counterfactual cause is always a 
cause, (ii) if a variable is not counterfactual under any pos- 
sible assignment of the other variables, then it cannot be a 
cause, and (iii) if X = x° is a counterfactual cause under 
some assignment that inverts a subset S of the other vari- 
ables, then no part of S should be by itself counterfactual. 

We revisit the rock thrower example to demonstrate how 
FC (like AC) can handle preemption. In Sect. 3.2 and 
|Sect. B.lj we show how functional causality successfully han- 
dles cases where the HP definition does not give the intu- 
itively correct result, as in |Example 2.6] (Shock C). 

Example 2.12 (two throwers revisited). Themin- 
imal dissociation expression for $ = A V (Aa B) with respect 
to B is ^ = Ai V (A2 A B), and is depicted in Fig. 5 Then: 



Ps — a + b — ab 

Pm = a± + b — a±b — a-zb + a\a^b 

For S = {A}, APs(B,S) / 0. If (P, /) the folding of * 
into $, then [S] t = {At, A 2 }, and AP^(A 1 ) / 0, so B is 
not a cause. 

Hence, the definition of functional causes effectively cap- 
ture the difference between the two networks for the two 
thrower example ( |Fig. 2\ while only focusing on the input 
nodes. In the case of the simple network, Pj> = P* and 
for S — {A}, B can be shown to be a cause. However, 
in the more complicated network, the potential function of 
the dissociation expression gives priority to A's throw and 
determines that B is not a cause of the bottle breaking. 



If the causal network is a tree, then the causal formula 
is itself a dissociation expression with potential P$. Then, 
(FC2) simplifies to: (a) APg>(S,X z ) / and (b) VS' C S : 
APs(S') = 0. Causal networks which are trees form an im- 
portant category of causality problems as they model many 
practical cases of database queries, and they are character- 
ized by desirable properties, as we show in |Sect. 3.4| 

Responsibility. Responsibility is a measure for degree 
of causality, first introduced by Chockler and Halpern |6]. 
We redefine it here for functional causes. 

Definition 2.13 (Responsibility). Responsibility p of 
a causal variable Xi is defined as 



Sl + l 



where S the mini- 



mum support for which Xi is a functional cause of an effect 
under consideration, p := if Xi is not a cause. 

Responsibility ranges between and 1. Non-zero responsi- 
bility (p > 0) means that the variable is a functional cause, 
p = 1 means it is also a counterfactual cause. 

3. FORMAL PROPERTIES 

Functional causality encodes the semantics of causal struc- 
tures with the help of potential functions which are depen- 
dent only on the input variables. In this section we demon- 
strate that reasoning in terms of functional causality pro- 
vides a more powerful and robust way to reason about causes 
than actual causality. In addition, we give a transitivity re- 
sult and use it to derive complexity results for certain types 
of causal network structures. 

3.1 CCCFCC AC 

Functional causes are a refined notion of actual causes. 
Even though the definition of AC does not exclude depen- 
dent variables, functional causality does not consider them 
as possible causes, as their value is fully determined from 
the input variables. The relationship of functional causality 
of input variables to actual and counterfactual causality is 
demonstrated in the following theorem. 



Theorem 3.1 (CC-FC-AC Relationship). Every X -. 
x~ that is a counterfactual cause is also a functional cause, 
and every X = x° that is a functional cause is also an actual 
cause. 



As we have seen with the Shock C example ( Example 2.6 1 



the HP definition of actual causes is too permissive and de- 
termines variables to be causes which should intuitively not 
be such. The definition of functional causality fixes these 
problems. | Appendix B.l| gives a detailed treatment of the 
Shock C example, both from FC and AC perspectives, and 
also provides insight into the problems of actual causality. 

3.2 Causal Network Expansion 

Functional, as well as actual causes, rely on the causal 
network to model a given problem. The two different models 
of the thrower example displayed in [Mg~2] demonstrate that 
changes in the network structure can help model priorities 
of events, which in turn can redefine causality of variables. 

In |Example 2.5| B is removed as a cause by the addition 
of an intermediate node in the causal network structure that 
models the preemption of the effect by node A (Alice's rock 
is the one that breaks the bottle) . This change is also visible 
in the causal Boolean formula, which is transformed from 
$ = A V B to $1 = A V (A A B). As we know from Boolean 



5 



Y 1 = AVB 



B = l 




Yi = A\/B 



B = l 




Y = YxVB 



(a) 



Y 2 =B Y = Y 1 \/Y 2 
(b) 



Figure 6: Expansion can cause problems for the HP 
definition: Introducing node Yjj in (b), which merely 
repeats the value of B, does not change function 
Y(X), but makes A an actual cause. 



algebra, the two formulas are equivalent as they have the 
same truth tables. However, they are not causally equivalent, 
as they yield different causality results. 

Therefore, the grammatical form of the Boolean expres- 
sion is important in determining causality, and the fun- 
ctional definition captures that through dissociation expres- 
sions. It is important to understand how changes in the 
causal network affect causality, and whether we can state 
meaningful properties for those changes. 

We define causal network expansion in a standard way by 
the addition of nodes and/or edges to the causal structure. 
A network CN e with formula $ e is a node expansion (re- 
spectively edge expansion) of CN with formula <£» if it can 
be created by the addition of a node (respectively edge) to 
CN, while $ e = CN e is a single-step expansion if it is 
either a node or an edge expansion of CN. 

Definition 3.2 (Expansion). A network CN e is an 
expansion of network CNiffB set {CNi, CN2, . . . , CNk} with 
CNi = CN and CNk = GN e , such that CWi+i is a single step 
expansion of CN, Vi G [1, k]. 

Networks represented by the formulas $1 = A V (A A B) and 
$2 = {A A B) V B are both expansions of $ = A V B, but 
note that $1 and $2 are not expansions of one another. 

As shown by the thrower example, network expansion can 
remove causes. As the following theorem states, it can only 
remove, not add causes. 

Theorem 3.3. If CN e with formula 5> e is an expansion 
of CN with formula $ and Xi = x° is a cause in 4> e then 
Xi — is also a cause in <f>. 

Specifically in the case where no negation of literals is 
allowed, changes to the structure do not affect the causality 
result. 

Theorem 3.4. IfCN e with formula <& e is an expansion of 
CN with formula <E> that does not contain negated variables 
then (f> and (f> e have the same causes. 

The properties of formula expansion are important, as 
they prevent unpredictability due to causal structure changes. 
Note that the Halpern and Pearl definition does not handle 
formula expansion as gracefully. |Figure 6] demonstrates with 
an example that the HP definition allows introducing new 
causes with expansion. A = 1 is not a cause in the simple 
network of |Fig. 6a| but becomes causal after adding node I2 
in [Fig" 6b| Therefore, network expansion is unpredictable 
for actual causes, as there are examples where it can both 
remove ( |Fig. 2\ or introduce new causes ( |Fig. 6| |. This is a 



strong point for our definition, as causality is tied to the net- 
work structure, and erratic behavior due to minor structure 
changes, as is the case in this example, is troubling. 

3.3 Functional causes and transitivity 

Functional causality only considers input nodes in the 
causal network as permissible causes for events 1 . Under this 
premise, the notion of transitivity of causality is not well- 
defined, since dependent variables (such as B in the Shock 
C example 2.6 1 are never considered permissible causes of 
events in their descendants. In order to ask the question of 
transitivity, we allow a dependent variable Yi to become a 
possible cause in a modified causal model M' with Y\ as ad- 
ditional input variable. We achieve this with the help of an 
external intervention [Yi <— j/J], setting the variable to its 
actual value y^. The new model is then M' = (N,J-') with 
modified structural equations T' = T\ {Fy 1 } U {F Yl } , where 
Fy — y1, and hence new input variables X' = (X,Yi) with 
original assignment x'° = (x°,yi). 

We can now ask the question of transitivity as follows: 
Assume that an assignment X — x is a cause of Yi = yl 
in a causal model M. Further assume that Y\ — y° is a 
in 



cause of Y2 = yi in the modified network [Yi <— yj]. Is 
then X = x° a cause of Yia = y\ in the original network 



Ml In agreement with recent prevalent (yet not undisputed) 
opinion in causality literature 
is not transitive, in general. 
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functional causality 



Corollary 3.5 (Non-transitivity). 
sality is not transitive, in general. 



Functional cau- 



Consider again the shock C |example 2.6| A = 1 is a functio- 
nal cause of B = 1, and B — 1 is a functional cause of C = 1 
in the modified model [B <— 1] . However, A — 1 is not a 
functional cause of C=l (see | Sect. B.l| for details). 

Intransitivity of causality is not uncontroversial [l9] and 
humans generally feel a strong intuition that causality should 
be transitive. It turns out that functional causality is actu- 
ally transitive in an important type of network structure 
that relates to this intuition: Transitivity holds if there is 
no causal connection between the original cause (X) and 
the effect (Y2) except through the intermediate node (Yi). 
This property allows us to deduce a lower complexity for 
determining causality in restricted settings in |Sect. 3.4| 

Definition 3.6 (Markovian). A node N is Marko- 
vian in a causal network CN iff there is no path from any 
ancestor of N to any descendent of N that does not pass 
through TV. 

Proposition 3.7 (Markovian transitivity). Given 
a causal model M in which X = x° is a cause of Yi =y? with 
responsibility pi , and in which Y\ is Markovian. Further as- 
sume that Yi = y° is a cause ofYi = y% with responsibility p 2 
in the modified causal model [Yi <— yi] . Then X = x° is a 
cause of Y2 =y§ in M with responsibility 



f>2 



1 This restriction avoids dealing with problematic, inconsis- 
tent assignments of variables, which turns out to be one prin- 
cipal reaso n why the HP definition gives counter-intuitive 
results. See |Appendix B.T| for a detailed discussion. 
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3.4 Complexity 

Analogous to Eiter and Lukasiewicz's result that deter- 
mining actual causes for Boolean variables is NP-hard [9], 
determining functional causality is also NP-hard, in general. 

Theorem 3.8 (Hardness). Given a Boolean formula 
<3> on causal network CN and assignment x° of the input 
variables, determining whether Xi 
$(x°) is NP-hard. 



N(cws feeds) 



x 1 ^ is a cause of (j) 



Even though determining functional causality is hard, there 
are important cases that can be solved in polynomial time. 

Trees. If the causal network is a tree, then the dissociation 
network is the same as the causal network and there is a 
single potential function. Determining causality on a tree 
can be simplified, as a result of the Markovian transitivity 
property |Prop. 3/T| and the fact that all nodes in a tree are 
Markovian. 

Lemma 3.9 (Causality in Trees). If Xi = x° is a 
cause of the output node Y in a tree causal network, and 
p = {X, Yi, Y?, . . . , Y} the unique path from X to Y , then 
every node inp is a functional cause of all of its descendants 
in p. Consequently, X is a cause of all Yi G p. 



Following from |Lemma 3.9[ causality in cases of tree- 
shaped causal structures with bounded arity (number of par- 
ents per node) is decidable in polynomial time. 

Theorem 3.10 (Trees with arity < k). Given a tree- 
shaped causal network with formula $ and bounded arity 
and actual assignment x° of the input variables, determi- 
ning whether Xi = X; is a cause of <j> = is in P. 



An even better result is given by |Theorem 3.1l) that cov- 
ers the case of causal structures where the function at every 
node is a primitive boolean operator (AND, OR, NOT), without 
any restrictions on the arity. 

Theorem 3.11 (Trees with Primitive Operators). 
Given a tree causal network with formula $ where the func- 
tion of every node is a primitive boolean operator, i.e. AND, 
OR, NOT, and assignment x° of the input variables, determi- 
ning whether Xi = x^ is a cause of 4> = $(5°) is in P. 

As demonstrated by Olteanu and Huang in [25], the lin- 
eage expressions of safe queries do not have repeated tuples. 
Lineage expressions for conjunctive queries with no repeated 
tuples correspond to causal networks that are trees. Follow- 
ing directly from |Theorem 3.11| we get complexity results 
for safe queries. 

Corollary 3.12 (Causes of Safe Queries). De- 
termining the causes of safe queries can be done in poly- 
nomial time. 

In these tractable cases, due to the transitivity property, 
responsibility can also be computed in polynomial time, us- 
ing the formula of |Prop. 3.7| 

Positive DNF and CNF. Another important category of 
tractable networks are those that correspond to DNF and 
CNF formulas with no negated literals. This category covers 
important cases of join queries in a database context. 



nid 


story 


source 
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... schools celebrate Indiana's birthday ... 


IndyStar 
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... economic downturn affected sensitive ... 


NYTimes 


3 


... with sequences shot in Indianapolis ... 


IndyStar 
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... House Approves Bill That Would Ease ... 


NYTimes 


5 


... new Bill approved yesterday ... 


IndyStar 
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... PODS held in Indianapolis this year ... 


NYTimes 


7 


... discussed in a recent talk the options to ... 


NYTimes 


S 


... Swine Flu Death Toll at 10,000 Since ... 


NYTimes 


9 


... Indianapolis welcomes SIGMOD/PODS ... 


IndyStar 



Filtered feed) 



story 



schools celebrate Indiana's birthday ... 
economic downturn affected sensitive ... 
with sequences shot in Indianapolis ... 
House Approves Bill That Would Ease ... 
PODS held in Indianapolis this year ... 
discussed in a recent talk the options to 
Swine Flu Death Toll at 10,000 Since ... 



Figure 7: News feed with aggregated data from dif- 
ferent sources (above), the filtered feed (below). 



Theorem 3.13 (Positive DNF). Given a positive DNF 
formula $ and assignment x° of the input variables, deter- 
mining whether Xi — x\ is a cause of </> — $(5°) is in 
PTIME. 



Theorem 3.14 (Positive CNF). Given a positive CNF 
formula $ and assignment £° of the input variables, deter- 



mining whether Xi 
PTIME. 



a cause of 



$(f°) 



4. EXPLAINING QUERY RESULTS 

In this section, we show how causality can be applied to 
address examples from the database literature, like prove- 
nance and "Why Not?" queries, as well as examples show- 
casing causality of aggregates. We also demonstrate how our 
causality framework can model different types of elements 
that can be considered contributory to a query result, like 
query operations instead of tuples. 

4.1 Why So? and Why No? 



We revisit our motivating example (Example 1.1 1, but in- 
troduce a slight variation that aggregates data from different 
news sources to demonstrate how functional causality can be 
used to answer Why So? and Why No? questions. 

Example 4.1 (News aggregator). A user has ac- 
cess to the News feed relation N, depicted in \Fig. % N con- 
tains news articles from two different sources, the NY Times 
and the local IndyStar. The user likes to read the local news 
from IndyStar, but she prefers the NY Times with regards to 
broader US or world news. Hence, she does not want to read 
on topics from IndyStar that are also covered by NY Times. 
Her filtered feed is constructed by the query 

select N . story 



from 
where 



N 

N . source= f NYTimes ' 
or not exists ( 

select * 

from N as Nl 

where topic (Nl . story) =topic (N . story) 
and Nl . source=' NYTimes ' ) 
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where topic () is a topic extractor modeled as a user-defined 
function. The user's filtered feed will contain stories from 
NY Times, and only those stories from IndyStar that NY 
Times does not cover. Simply, if Sny is an article in NY 
Times covering a topic, and Si an article in IndyStar about 
the same topic, whether the user will see this topic in her 
feed or not follows a causal model similar to that of \Fig. 5a\ 
with boolean formula <!> = Sny V (Sny A Si). The topic 
appears in F if it appears in either NY Times or IndyStar, 
but the first gets priority. 

When asking what is the cause of getting an article on 
Indiana's birthday, the user gets tuple 1 from relation N, as 
it is counter factual. When asking what is the cause of seeing 
an article on PODS, she gets the NY Times article (tuple 
6), even though IndyStar also had a story about it (tuple 9). 
The analysis is eguivalent to the rock thrower example. 

The framework can be used in a similar fashion to respond 
to "Why No?" questions. Assume tuple iio =(10, ' . . . im- 
migration officials arrest 300 .NYTimes) , which was 
present in yesterday's news feed, but was since then re- 
moved. Tuple tio is a functional cause to the Why No? 
question: "Why do I not see news on immigration", as it is 
counterfactual. Its removal from the feed caused the absence 
of immigration topics in the user's filtered view. 

4.2 Aggregates 

We next show how functional causality can be applied to 
determine causes and responsibility for aggregates. We focus 
here only on positive integers and give complexity results for 
Why so? and Why no? for Why is SUM > c? and Why 
is SUM ^ c?. 

Notation. Let Q G {SUM, MAX, AVG, MIN, COUNT} be an ag- 
gregate function Q(V) evaluated over a multiset of values V 
from the domain of positive integers, i.e. Vi G N. Consider 
a view R with a certain attribute A over which we evalu- 
ate the aggregate function. Let T be a tuple universe under 
consideration (i.e. a set of tuples which we consider possible 
or, simply, the cross product of the active domains for each 
attribute in R), T + C T the subset of tuples that is in R (i.e. 
that is true under current assignment) and T~ = T — T + 
be those tuples from the tuple universe which are missing 
(i.e. who are false under current assignment). Denote X the 
vector of Boolean variables where Xi is true or false de- 
pending on whether the corresponding tuple ti G T is in T + 
or not. We write Q(X) as notational shortcut for Q evalu- 
ated over the subset of V + C V for which the corresponding 
Boolean value is true: V + = {vi \v% G V A Xi = 1}. For 
example, SUM(5°) can stand for the query select SUM(R.A) 
from R if R contains tuples with values from V in the at- 
tribute R. A. Let op G {>, >, <, <, =, 7^}. An aggregate con- 
dition u>° op c for a given constant c is a Boolean expression 
that is true or false for given assignment xP . 

Definition 4.2 (Why so? and Why no?). Letup = 
fl(aP) be the value of an aggregate function for current as- 
signment xP . The guestion of Why so? (respectively, Why 
no? ) for a condition uP op c that is true (respectively, false ) 
under the current assignment corresponds to the guestion of 
which set of tuples {ti} from the tuple universe with original 
assignment xi = 1 (respectively, 0) is a cause of the event 
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Figure 8: Sum example, (a): Relation R with tuples 
from tuple domain T. (b): Responsibility pi of i, for 

Why so? (SUM(x°)>c) and Why no? (SUM(f°)2c). 



Example 4.3 (Sum example). Consider a tuple uni- 
verse f = [(10), (20), (30), (50), (100)] and a view R(A) with 
the subset of tuples R — {(20), (30), (100)}. Now consider 
the guery select SUM(R.A) from R executed over the view 
R which returns 150. In our notation, this is represented 
with a vector V = [10,20,30,50,100], c urrent a ssignment 
f° = [0,1,1,0,1], and SUM(f°) 
Why SUM > 

sponsibility §. FC2(a): SUM(f 1 ) 2 30 for x 1 = [0,1,0,0,0]. 
FC2(b): SUM(x 1 *) > 30 for every assignment x 1 * withx\* = 
1 and any subset of {x\ = 0} inverted to its original assign- 
ment. In contrast, t2 is not a cause: While FC2(a) holds for 
x 1 = [0,0,0, 1,0] with SUM^r 1 ) 2 30 (and then t 2 would be 
counterfactual), FC2(b) is not fulfilled for x 1 * = [0,1,0,0,0]. 

Why SUM ^ c?: U is a cause of (SUM(£°) > 180) = false, 
as both Xi and the condition are false under current assign- 
ment, but would hold for x 1 = [0, 1, 1,1, 1]. 



150 (see Fig. , 
t3 is a cause of SUM(af) > 30 with re- 



(u> op c = true) (respectively, false) with responsibil- 



ity pi 



|Figure 8b] shows resp onsibility for different values of con- 
stant c in |Example 4.3| and illustrates that responsibility for 
SUM is not monotone. In order to compute responsibility 
for a tuple ti, one must find the smallest set of tuples that, 
when inverted (i.e. either inserted or deleted) make tuple 
U counterfactual for the condition. We next give complex- 
ity results for the SUM aggregator and show that evaluating 
causality for SUM > c is already hard for one relation. 

Lemma 4.4 (Sum possible causes). If a tuple U is a 
cause to a Why SUM > c? (respectively, Why SUM ^ c?) 
guestion, then ti is true (respectively, false) under the ac- 
tual assignment. 

Proposition 4.5 (Why so? = Why no?). Answers to 
the question Why SUM > c? for an aggregate condition 
(SUM > c) = true are the same as Why SUM 2 c ? for its 
inverse (SUM c) = false. 

Theorem 4.6 (Sum hardness). Determining Why SUM 
> c? is NP-complete even for one single input relation. 

Theorem 4.7 (Sum PSEUDO-PTIME) . Determining re- 
sponsibility of a tuple with value v for (SUM(3?°) > c) = true 
for one single input relation allows a pseudo-polynomial time 
algorithm O (n(u° -c + v)) where u>° = SUM(x°). 

Example 4.8 (News Feed continued). We will now 
revisit our motivation example ^Example 1.1)) . The user 
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Figure 9: Books in "Ye Olde Booke Shoppe" |4j. 



may be surprised by the increased occurrence of Indianapolis 
in her personalized feed (5 in total) during a certain week, 
which is a deviation from the norm. The user can ask a 
causality query, "Why are there more than 3 occurrences of 
Indianapolis?". This is a Why So? query about the COUNT 
on a join between two tables (N and K). The system can 
calculate the responsibilities of the user's keyword for this 
aggregate being more than expected. In this case, the respon- 
sibilities for the keywords DB_conf, Purdue and Movies are 
1, | and | respectively. This is because DB_conf is a cou- 
nterfactual cause of C0UNT> 3, while the others are causes 
with support of size 1. This result is intuitive, as there more 
articles with the DB_conf tag (SIGMOD/PODS happening 
in Indianapolis) , than stories with tags Movies or Purdue. 

Similarly, a user may have actually expected to see more 
news about Indianapolis than the ones she's getting: "Why 
aren't there more than 6 stories on Indianapolis?". The sys- 
tem can identify the keyword Indy_500 as a cause, as it is 
counter-factual: adding it to the user's keyword list makes 
the COUNT more than 5. Presented with that causality result, 
the user may decide to include the new keyword in her feed. 

4.3 Causes beyond tuples 

Provenance and non-answers commonly focus on tuples 
as discrete units that have contribution to a query result. 
Our causality framework is not restricted to tuples, but can 
model any element that could be considered contributory to 
a result. To showcase this flexibility, we pick an example 
from Chapman and Jagadish 4 that models operations in 
workflows as possible answers to "Why not?" questions. 

Example 4.9 (Book Shopper [I], Ex. 1). A shop- 
per knows that all "window display books" at Ye Olde Booke 
Shoppe are around $20, and wishes to make a cheap pur- 
chase. She issues the query: Show me all window books. 
Suppose the result from this query is (Euripides, "Medea"). 
Why is (Hrotsvit, "Basilius") not in the result set? Is it not a 
book in the book store? Does it cost more than $20? Is there 
a bug in the query-database interface such that the query was 
not correctly translated? 



Workflow input 



Ye Olde 
Books 



Manipulation 1 



Workfbwoutput 



Manipulation 2 



Select Books 
<=$20 



Apply Season 
Criteria 



Window 
Books 



Figure 10: Variation of the query workflow from |4|. 

Chapman and Jagadish consider a discrete component of a 
workflow, called manipulation, as an explanation of a "Why 




Y = Mi A Y\ 



M 2 &^Y 1 = M 1 \/M 2 
(a) 




=M hl AY 1 



Figure 11: The causal network of |Example 4.9 (a), 
and its DN with respect to A/2 (b). 



not?" query. The workflow describing the query of the ex- 
ample is shown in |Fig. 10] Roughly, a manipulation is con- 
sidered picky for a non-result if it prunes the tuple. For 
example, manipulation 1 of |Fig. 10] is picky for "Odyssey", 
as it costs more than $20. Equivalently, a manipulation is 
frontier picky for a set of non-results, if it is the last in the 
workflow to reject tuples from the set. In this framework, 
the cause of a non-answer will be a frontier picky manipula- 
tion. 

In Example 4.9 tuple t =(Hrotsvit, "Basilius") passes the 



price test, but is cut by manipulation 2 as it doesn't satisfy 
the seasonal criteria. The causal network representing this 
example is presented in |Fig. 11a] Input nodes model the 
events: Mi: manipulation 1 is not potentially picky with re- 
spect to t, and M2: manipulation 2 is not potentially picky 
with respect to t. At the end, the tuple appears only if nei- 
ther manipulation is picky: Mi A M2. Intermediate node Y\ 
encodes the precedence of the manipulations in the work- 
flow. A tuple will be stopped at point Y\ of the workflow 
if M2 is picky but Mi was not: Mi A Mi. It will pass this 
point if the opposite holds, so Y\ = Mi A M2 = Mi V M2 , 
and Y = Mi A Y\. 

Applying the FC framework for Mi = 1 (Mi is not picky), 
and M2 = (M2 is picky), correctly yields that M2 is the 
only cause: S = 0, AI^^Mi) 7^ 0. If both manipulations 
were potentially picky (Mi = and M2 = 0), the FC defini- 
tion again correctly picks Mi as the only cause with support 
S — {M2} (even though AI2 is potentially picky, the tuple 
never gets to it), which agrees with the Why not? frame- 
work that selects as explanation the last manipulation that 
rejected the tuple. 

5. RELATED WORK 

Our work is mainly related and unifies ideas from three 
main areas: research on causality, provenance, and missing 
query result explanations. 

Causality. Causality is an active research area mainly 
in logic and philosophy with their own dedicated workshops 
(see e.g. VU). The most prevalent definitions of causality are 
based on the idea of counter-factual causes, i.e. causes are ex- 
plained in terms of counterfactual conditionals of the form 
If X had not occurred, Y would not have occurred. This 
idea of counterfactual causality can be traced back to Hume 
[22] . The best known counterfactual analysis of causation in 
modern times is due to Lewis 18 . In a databases setting, 



Miklau and Suciu [23] define critical tuples as those which 
can become counterfactual under some value assignment of 
variables. Halpern and Pearl [l3] (HP in short) define a 
variation they call actual causality. Roughly speaking, the 
idea is that X is a cause of Y if Y counterfactually depends 
on X under "some" permissive contingency, where "some" is 
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elaborately defined. Later, Chockler and Halpern [6* define 
the degree of responsibility as a gradual way to assign cau- 
sality. Eiter and Lukasiewicz [9] show that the problem of 
detecting whether X — x° is an actual cause of an event 
is Ef-complete for general acyclic models and NP-complete 
for binary acyclic models. They also give an alleged proof 
showing that actual causality is always reducible to primi- 
tive events. However, Halpern [12] later gives an example 
for non-primitive actual causes, showing this proof to ignore 
some cases under the original definition. Chockler et al. [7] 
later apply causality and responsibility to binary Boolean 
networks, giving a modified definition of cause which, as we 
show in |Sect. Bl2| introduces new counter-intuitive prob- 
lems and, despite claims to be otherwise, is not equal to the 
original HP definition of actual cause. 

Our definition of functional cause builds upon the HP 
definition, but extends it with several desirable properties: 
causes are always primitive input variables, network expan- 
sion cannot create new causes, and the definition fixes in- 
tuitive examples where the HP-definition does not follow 
consensus in the causality literature. It is these properties 
that allow us to apply our causality framework to a database 
setting in |Sect. 4| 

Provenance. Approaches for defining data provenance 
can be mainly divided into three categories: how, why, and 
where provenance [3j [5] [8] [lO] . In particular for the "why so" 
case, we observe a close connection between provenance and 
causality, where it is often the case that tuples in the prove- 
nance for the result of a positive query result are causes. 
While none of the work on provenance mentions or makes 
direct connections to causality, those connections can be 
found. The work by Buneman et al. [5] makes a distinc- 
tion between why and where provenance that can be con- 
nected to causality as follows: why provenance returns all 
tuples that can be considered causes for a particular result, 
and where provenance returns attributes along a particular 
causal path. Green et al. |10| present a generalization for all 
types of provenance as semirings; finding functional causes 
in a Boolean tree, if taken in a provenance context, yields 
degree-one polynomials for provenance semirings. View data 
lineage, as presented by Cui et al. [8] also addresses aggre- 
gates but lacks a notion of graded contribution and returns 
all tuples that contribute to an aggregate. 

In contrast, our approach can rank tuples according to 
their responsibility, hence our approach allows to determine 
a gradual contribution with counterfactual tuples ranked 
first. Also, in contrast to our paper, most of the work on 
provenance has little or no connection to the philosophical 
groundwork on causality. We take this work and signifi- 
cantly adapt it so that it can be applied to databases. 

Missing query results. Very recent work has focused 
on the question "why no", i.e. why is a certain tuple not 
in the result set? The work by Huang et al. 17 presents 



provenance for potential answers and never answers. In the 
case that no insertions or modifications can yield the desired 
result - usually for privacy or security reasons - the system 
declares that particular tuple a never answer. Both Huang's 
work and Artemis 
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handle potential answers by provid- 
ing tuple insertions or modifications that would yield the 
missing tuples. Alternatively, Chapman and Jagadish [4] fo- 
cus on which manipulation in the query plan eliminated a 
specific tuple. Lim et al. [20] adopt a third, explanation- 
based, approach. This approach aims to answer questions 



such as why, why not, how to, and what i/for context-aware 
applications, but does not address a database setting. 

Our work, unifies the above approaches in the sense that 
we model both, tuples or manipulations as possible causes 
for missing query answers. Also, our approach unifies the 
problem of explaining missing query answers (why is a tuple 
not in the query result) with work on provenance (why is a 
tuple in the query result). 

Other. Minsky and Papert initiated the study of the 
computational properties of Boolean functions using their 
representation by polynomials and call this the arithmetic 
instead of the logical form | 24| p. 27]. This method was later 
successfully used in complexity theory and became known 
as arithmetization [2]. 

6. CONCLUSIONS AND FUTURE WORK 

In this paper, we defined functional causes, a rigorous 
and extensible definition of causality encoding the seman- 
tics of causal structures with the help of powerful potential 
functions. Through theoretical analysis of its properties, we 
demonstrated that our definition provides a more powerful 
and robust way to reason about causes than other estab- 
lished notions of causality. Albeit NP-hard in the general 
case, common categories of causal networks that correspond 
to interesting database examples (e.g. safe queries) prove to 
be tractable. We presented several database examples that 
portrayed the applicability of our framework in the context 
of provenance, explanation of non-answers, as well as aggre- 
gates. We demonstrated how to determine causes of query 
results for SUM and COUNT aggregates, and how these can be 
ranked according to the causality metric of responsibility. 

Overall, with this work we establish the theoretical foun- 
dations of causality theory in the database context, which 
we view as a unified framework that deals with query result 
explanations. It also brings forth many interesting problems 
that can be explored in future work. 

This paper focused on deterministic cases; we plan to ex- 
tend our framework to probabilistic data in the future. The 
fact that functional causes are based on the use of potential 
functions makes this extension straightforward: the set of 
Boolean variables X for tuples in the deterministic case be- 
comes a set of probabilities x. Note that this would not be 
possible if the causality definition had used just the Boolean 
formulas. Potential functions also have the additional ad- 
vantage that they can be analytically manipulated as op- 
posed to Boolean functions. We currently investigate the 
properties of their derivatives with the intuition that they 
reveal another facet of causality, particularly with regard to 
aggregates and probabilities. 

Acknowledgements. We like to thank Christoph Koch 
for valuable insights, and Chris Re for helpful discussions in 
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APPENDIX 

A. NOMENCLATURE 



N 
X 
Y 

M = (N, T) 



x°,y° 



[3<-i 

CN 
«(-*) 



DE,DN 

t : Ni -> N 2 



in 

X t ,Y t 
%t,yt 

P^(x),P 9 (x t ) 
AP pQ) 

x°®S 



AP # (S) 
S 

[S]t 



Set of Boolean random variables 

Set of input variables 

Set of dependent variables: Y = N \ X 

Boolean causal Model with nodes N = X U Y and 

functional equations T = {Fjy\N £ N}. Fx for 

an input variable X is its actual assignment x°. 

Actual truth assignment of input variables, and 

resulting truth assignment for dependent vari- 

ables: X t (x°) = x°, Y t (x°) = y° 

External intervention replacing the structural 

equation Pjv for each node N in S with a truth 

assignment n 1 

Causal Network of a causal model 
Boolean formula for the effect in the CN. Corre- 
sponds to Yj(X) for chosen effect variable Yj 
Effect under consideration. Event of Q(X) = 
<I>(at°), i.e. the effect variable having its actual as- 



V?) 



signment <j> = (Yj 

Dissociation Expression, Dissociation Network 
Transformation from network JVi to network N2 ■ 
t : CN — > DN represents the transformation from 
a causal to a dissociation network. 
Mapping of a set of nodes V from network N\ to 
JV2 under transformation t : N\ — > N2 
Sets of Boolean variables in DN: Xt = [X]t and 
Y t = [Y] t for t : CN — » DN 
Sets of functional variables in CN 
Sets of functional variables in DN 
Potential functions in CN and DN, respectively 
Change in potential function by inverting input 
Xf. APz(X x ) = P 9 (x°) - P (l - \ {xi}) 

the assignment obtained by starting from x° and 
inverting all variables in S: {1 — | Si S S} U 

Change in potential function by inverting all vari- 
ables in S: AP $ (5) = P*(2°) - P*(x° S) 
Subset of X chosen for condition FC2(b) 
Set St C Xt that corresponds to S C X: St = 
{X tj \Xj g S} 



B. DETAILS SECTION 2 

B.l Details on AC and FC for Shock C 

In |Sect. 2.2} we showed that the HP definition of actual 
causes incorrectly models A= 1 to be a cause of C — 1 in the 
Shock C example. We also mentioned but did not show that 
functional causes can model this example correctly. Here, 
we give the details on this issue. In particular, we show 
that functional causes can model common sense causality 
correctly (i.e. B's decision to be mean is a cause for C being 
shocked) with the help of appropriate policy variables, while 
actual causes cannot, even with the help of more complex 
network structure 2 . We then give an intuitive explanation 
of why and where the HP definition of actual causes fails. 

Note that the Shock C example [2l] is an important ex- 
ample from the philosophical literature that illustrates that 



2 Halpern and Pearl stress the importance of careful causal 
modeling [13| Sec. 6] , implying that actual causes can handle 
cases correctly given the appropriately modeled network. 
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causality is not transitive, in general 3 . When philosophers 
[15) [22] and HP [l3] argue for intransitivity of causality, 
they use examples similar to this one as arguments. Out of 
the many examples, Shock C is the most compelling case, 
and the HP definition does not model it correctly. When 
HP argue for intransitivity of causality, they first have to 
tweak this example into some modification |13| Example 
4.3] where their definition happens to work correctly. In 
contrast, the definition of functional causes does give the 
correct attribution of causes given the appropriate network. 
Also note that the Shock C example is structurally equiva- 
lent to the king-assassin-bodyguard example |22[ Sec. 4.3], 
another counterexample to the HP definition. 

Example B.l (Shock C - Model 1). In \Example 2.6] 
we used the causal model in \Fig. Jj| i.e. structural equations 

B — A 

C — (A = B) — AB V AB 

under actual assignment A=l, and hence B = l, C= 1. In 
our notation, the set of input variables is X = {A} and the 
set of dependent variables Y — {B,C}. The effect ip under 
consideration is C = 1. 

J\B = A 
A=\d— -^o C=(A = B) 

Figure 12: Model 1: Simple causal model for the 
Shock C example. Both A — 1 and B — 1 are actual 
causes, neither of them is a functional cause for C = l. 

AC : Here, and contrary to common sense, both A = 1 
and B = 1 are actual causes of C = 1: (i) A = 1 is an 
actual cause for W = {B} with 6=1. Then AC2(a) holds: 
[A^Q,B<-1] -><(>. Also AC2(b) holds: Z\N C \{C} is 
empty, and C is 1 for either W' = W or W' = 0, because of 
A*—l. (ii) B = l is an actual cause for W = 0. 

FC : There is no functional cause of C = 1 as its formula 
is a tautology. B is a dependent variable and hence B = 1 
not a permissible cause. A is the only input variable. Since 
it is not a counter factual cause and there is no other input 
variable to invert for S, it is not a functional cause either. 

The appropriate intuition is that C = 1 holds no matter 
what the assignment of the leaf nodes are. Hence there is 
no cause. If we want to model _B's decision to be mean as 
a possible cause, we need to model his "intention" with an 
appropriate policy variable as shown next. 



3 Several philosophers have ta ken issue with the idea of cau- 
sality being intransitive (e.g. [19]) as it seems c oun ter intu- 
itive at first sight. This resonates with Pearl [27| p. 237] 
asking "why transitivity is so often conceived ot as an in- 
herent property of causal dependence". He continues: "One 
plausible answer is that we normally interpret transitivity 
to mean the following: If (1) X causes Y and (2) Y causes Z 
regardless of X, then (3) X causes Z." In |Sect. 3.3| we have 
formalized this observation and given a concrete Markovian 
criterium as sufficient criterium for functional causality to 
be transitive. 



Example B.2 (Shock C - Model 2). Weusethemore 
elaborate causal model from \Fig. 1S\ with structural equations 

B = (M = A) = MA V MA 
C = (A = B) = AB V AB 

under actual assignment x° — {A = 1, M = 1}, and hence 
y° = {B = 1, C —1} . The intuition is that player B now has 
the option to be either mean (M = 1) and follow the decision 
of A with MA, or not to be mean (M or M = 0) and do the 
opposite of A, i.e. MA. The motivation is to introduce 
a new leaf policy variable whose actual assignment M — 1 
models a permissible modified cause, i.e. B's decision to be 
mean as a leaf node. 

M = lo 

\\B = MA\/ MA 
A = \^— -^o C = (A = B) 

Figure 13: Model 2: Causal model with explicit pol- 
icy variable M modeling B's decision to be mean. All 
three A = l, M = l, and B = l are actual causes for C 
being shocked according to the HP definition. Only 
B's decision to be mean (M = l) is a functional cause 
according to our definition, which arguably better 
represents the common sense interpretation. 

AC : Here again A = 1 is an actual cause of C = 1. In 
addition, M = 1 and 5 = 1 are causes. Details are the same 
as in \Example B.l\ 

FC : M = l is afunctional cause with responsibility 1 (i.e. 
a counter factual cause) of C=\, but A = l is not. (i) A— I 
is not a functional cause: First try S = (equivalent to 
counter-factual cause). AP^(A) — 0, hence it fails FC2(a). 
Second try S = {M}. Then FC2(a) holds: AR t (A, M) / 0. 
However, FC2(b) fails for S' = S = {M}: AP 4 ,(M) ^ 0. 
Hence, A = 1 is not a functional cause, (ii) M = 1 is a 
functional cause with responsibility 1, i.e. a counter factual 
cause. Inverting M when 5* = inverts C. 

Intuition for the AC failure and FC success. The 

reasons for the HP definition to give undesired results seem 
to be twofold: (1) The HP definition allows W to be chosen 
from any node in the causal network, i.e. including nodes 
in the causal path from the alleged cause Xi to the effect 
variable Yj\ and (2) it allows to give the actual assignment 
n\ — n\ to nodes in W, i.e. without inverting them. In 
contrast, our definition of functional causes makes the fol- 
lowing changes: (1) we only consider leaf nodes as possi- 
ble contingencies (i.e. to include in S). (2) Since we only 
consider input nodes, we have some implicit minimality cri- 
terion for S. If a variable Xk does not have to be inverted 
(x\ = xi) to make another variable a cause, it does not have 
to be included in S. (3) We only consider input variables 
(i.e. leaf nodes) as permissible causes. This has intuitive, 
practical, and also philosophical appeal: an intermediate 
dependent variable should not be credited with being cause, 
it is rather some decision to follow some structural depen- 
dency (i.e. some policy) rather than another that makes an 
intermediate node a "visible" cause. As illustrated with IEx^ 
|ample B~2] we can always introduce new policy variables to 
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a network to analyze the causal effects of structu ral equa- 
tions of intermediate nodes 4 . We used this idea in ISect. 4.3l 
to showcase the why-not approach from Chapman and Ja- 
gadish 4 in |Example 4.9| Here, we again use this idea to 
explicitly model B's decision to be mean as an independent 
input variable in X, and hence a possible functional cause. 

B.2 The CHK definition for Boolean Circuits 

Chockler, Halpern and Kupferman (CHK from now on) 
give a reformulated definition of actual cause for Boolean cir- 
cuits with [n Def. 2.4] and argue that binary acyclic causal 
models are equivalent to Boolean circuits, i.e. Boolean causal 
networks where intermediate nodes represent the Boolean 
operations A, V, or -i, and negations occurs only at the level 
above the input nodes. As we will show with a simple ex- 
ample, this CHK definition of causality for a Boolean circuit 
is not equivalent to the original HP definition 5 . 

Example B.3 (Loader [16]). For a firing squad con- 
sisting of shooters B and C, it is A's job to load B's gun. 
In an instance of this problem shown in \Fig. 1J\ A loads B 's 
gun (A = 1), B does not shoot (B — 0), but C shoots (C = 1) , 
and the prisoner dies (Y = l). 



Y 1= AAB 




y=y 1 vC 



Figure 14: A loads B's gun, but B does not shoot. 
C shoots and the prisoner Y dies. 

AC : The HP definition (as well as our FC definition) 
conclude that A is not a cause. This is in accordance with 
our intuition that A cannot be a cause of the prisoner dying 
if the gun A loads was not fired. 

CHK : The causal network of \Fig. T~4\ corresponds to a 
Boolean circuit with only AND/OR gates. According to 
Def. 2. 4], A is a cause of gate Y if there is an assignment 
that makes A "critical", i.e. counter] "actual in our notation. 
This assignment exists and is (b 1 = 1, c =0). Hence, ac- 
cording to the CHK definition, A — l is a cause for Y = l. 

Analysis of the CHK definition. The decisive differ- 
ences that the CHK definition makes over the original HP 
definition are twofold: (1) The CHK definition does not in- 
spect the causal path, i.e. the possible changes that a new 
assignment inflicts to the other gates (e.g. here on Yi). (2) 
The CHK definition does not check inverting all subsets of S. 
For example, in the loader example, the prisoner would not 
have died for the subset S' = {c 1 = 0} C S, which indicates 
that A=l should not be a cause. 

While our definition of functional cause also focuses on 
the input variables only, we made two crucial modifications 
that avoid new problems such as the loader example, and 
remedy existing problems of the original HP definition with 



4 This idea of pushing causes to the input nodes seems to 
be implicit in Pearl [261. Pearl states that "any external 
intervention (on a structural function) can be represented 
graphically as an added parent node". 

5 In p. 20:6] implies equality to the Boolean formulation of 
Eiter and Lukasiewicz [9j which is not true. 



cases such as the the Shock C example: (1) We use dissocia- 
tion expressions that allow us to manipulate subsets of dis- 
sociated input variables while testing causality, and hence, 
manipulate the relevant causal path only. (2) We test for 
all subsets of the support S, and hence, verify the causal 
relevance of the input variable under consideration. 

B.3 Expression Folding 



Proof 



Theorem 2.10 (DE Minimality). For all ex- 



pressions in V, there exists a folding to $. This means that 
every * G T> is syntactically equivalent with $, but may 
have one or more instances of variables replaced with new 
variables. If 3* G V such that |*| = |$|, then there is a 1 
to 1 correspondence of variables from * to <!> and therefore 
* = $. Assume * £ T> of minimum size. Obviously, if 
|*| = |$|, then V*' G V with *' = |*|, *' = *. 

We now look at the case where |*| > |$|. Assume *, *' G 
T> of minimum size, so |*| = *'|. For $ = cr(<!>i, $2, • ■ • , $fc), 
3^ such that $i= T(^i), where * = cr(*i, * 2 , • • • , *fc), and 
3F' such that <E>; = T'{%), where *' = a(*' 1 , * 2 , . . . , *' fc ). 
So = T'i^'i). From definition of folding, this holds 

for all subexpressions $i of 

If = 1*3 1 for all subexpressions <l? s of then *' = 
*. Assume $ a some subexpression of <!> (or $ itself), such 
that 3i,j such that |* Sji | < |* s>i | and |* s ,j| > 



ty's j\, while 
= 1*1. That 



|* s | > |*' s |. Such $ s has to exist because j*'| 
means that * s ^ * a . 

Construct ** = cr(* s ,i, * s , 2 , . . . , W sj , . . . , * s , fe ). So ** 
is the same as * s , apart from subexpression * s ,j which is 
replaced with * s j . Then 3 folding from ** to $5 and | ** | < 
* s |. This means that the DE ** that results from replacing 
* s with ** in * is also a DE for $ and |**| < |*|, which 
is a contradiction. Therefore * = □ 



C. DETAILS SECTION 3 

C.l Functional vs Actual Causes 



Proof [Theorem 3. 1| (CC-FC-AC Relashionship). If 
A", — is Et counterfactual cause, then it has functional re- 
sponsibility p — 1 (for S — 0, APij,(Xi) 7^ 0), and therefore 
is a FC. 

We will show that every FC is an AC. Obviously, condition 
FC1 implies AC1. We need to show that AC2 holds. 

If X, functional cause of effect 4> defined by 

Boolean formula $(A), then 3S C X\{X t } s.t. AP 9 {X i; S) / 
and \/S' t C [S] t AP*(^) = 0. 

We pick Z to be the causal path, and W the rest of the 
nodes. Assume W x = W n X, W v = W\W X . We pick 
assignment w xl as follows: w xl — -^x® if Xj G S and w xl = 
x°j otherwise. All nodes in W y are descendants of W x , so we 
assign w yl as the inferred values from assignment w xl . w" = 
w xl U w v '. From AP 9 (X Z ,S) == it follows that <f>(x'i <- 
—<x l l, w") = -^cf>, so AC2(a) is satisfied. 

Assume some W' C W and some Z' C Z, where Z is the 
causal path of Xi in CN. 

Set S w > = {Xj : Xj G W' and w'j = Obviously, 
S w , C S. Also, set S z > = [X] t nANC(Z'), where ANC(Z') 
the group of all ancestors to any node in Z' . Finally, set S' t = 
S W '\Sz', so S' t C [S]t. Setting S z i to the original values, 
ensures that all nodes in Z' axe set to their original values z* . 
Therefore, P*(-.s TO /°, sf\i w ,) = <- w',Z' <- z*). 
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Since APv(S' t ) = 0, $(a;?,W" <- w',Z' <- z*) = 
condition AC2(b) is satisfied. 

Condition AC3 is obvious, as X* is a single literal, and 
therefore Xi — x° is an actual cause. □ 

C.2 Formula Expansion 

In this section we give formal definitions of formula ex- 
pansion. 

Definition C.l (Node Expansion). Node expansion 
of a network CN with formula $ to a network with formula 
<3> e is the addition of a node V' along an edge (V, U) of the 
causal network CN, such that $ e = $, and none of the for- 
mulas of the dependent nodes change. 

Definition C.2 (Edge Expansion). Edge expansion 
of a network CN with formula <E> to a network with formula 
"3> e is the addition of an edge (V, V') in CN, such that $ e = 
$, and none of the formulas of the dependent nodes apart 
from V' change. 

Definition C.3 (Single Step Expansion). A network 
CN e with formula <3> e is a single-step expansion of network 
CN with formula $ if it is either a node or edge expansion 
of CN. 

Definition C. 4 (Expansion). A network CN e with for- 
mula "!> e is an expansion of network CN with formula $ iff 
there exists ordered set of networks {CNi, CN2, . . . , CNt} 
with CN± = CN and CN k = CN e , such that CN i+1 is a 
single step expansion of CN, Vi £ [l,fc]. 

Lemma C.5. If Xi = x° is a cause of effect <f> e in CN 
with formula <3> e , which is a single step expansion of formula 
$, then Xi = Xi is also a cause of effect (j> in formula 

Proof. Assume * the DE of $ and * e the DE of 3> e . For 
simplicity we say Xi is a cause, meaning Xi = x® is a cause. 
t and T represent the dissociation network transformations 
of CN and CN e respectively, with respect to t:CN—> 
DN, T : CN e -> DN e . So, [X] t the input nodes of DN, 
and [X]t the input nodes of DN e . We use the term potent 
to refer to variables in the causal network that map to more 
than one variable in the dissociation network. 

For easiness of representation, we also write P$(-is°) to 
denote P*(-is°, x* > \s), in other words, all the variables ap- 
pearing in the argument list of P|jJ are set to the denoted 
values, and the ones not appearing to their original values 
given by £°. 

Since Xi is a cause of <f> e , 3S such that AP$ e (S, Xi) 7^ 
and AP 9e (S' T ) = 0, VS T C [S] T - 

By definition of expansion, 4> e (A) = <&(X), and therefore 
P$ = P<s> c , which means that AP<}(S, Xi) 7^ for the same 
set S. 

If <I? e is a node expansion of $, then P* = P* c , and 
therefore Xi is a cause of <f>. 

If $ e is an edge expansion of <3>, by the addition of an edge 
{v,u), then node v may become potent with respect to Xi 
in $ e . If v is not potent, then [X]t = [X]t, and therefore 
P* c = Pp, which means that Xi is also a cause of <j>. 

If v is potent with respect to Xi, then DN e contains a 
set of replicated nodes V' of V (node v and its ancestors), 
which are not contained in DN, so [X]t C [X]t- Denote 
as X v the subset of X that are ancestors of v, and X v i the 



subset of X that are ancestors of the replica v' in DN e . 
Then X T = Xt U X v > . 

If X v n [S] T = 0, then {X v U X v ,} n St = 0. Then, 
St = St, and for any S' t C St, St is also a subset of St- 
That means that P£(-.St°) = P* e (->St°) = P* C (S°), which 
means that AP*(s^) = 0. 

If X v R S T + 0, then X„, n S T ^ 0. Then S t C 5V, as 
X„nS t / 0, but X„,nS t = 0. For any S' t C St,X£ = S' t nJ£ v , 
and X^,, its replicated equivalent in PAT. 3S T = U X^,, 
and S' T (Z St- Also, by definition of expansion, 

P-q/{~ *X V , ~ iSt \x v ) = 

= P^(-^x v ° ,^x v ,° ,^s t °\x v ) = Pj, c (f°) 

Therefore, APy(S' t ) = 0, VS^ C St in all cases of expan- 
sion, which means that Xi is a cause of <f>. □ 

Lemma C.6. If Xi — xi is a cause of effect <j) in CN with 
formula $, and 3> e a single step expansion of $ that does 
not contain negated variables, then Xi — xi is also a cause 
of effect (p e formula $ e . 

Proof. For the most part, this proof is similar to the 
proof of |Lemma C.5| 

Assume * the DE of <E> and * e the DE of $ e . For simplic- 
ity we say Xi is a cause, meaning Xi — x\ is a cause, t and 
T represent the dissociation network transformations of CN 
and CN e respectively, with respect to X*: t : CN — > P/iV, 
T : CAT -> DN e . So, [X] t the input nodes of DN, and 
[X]t the input nodes of DN e . We use the term potent to 
refer to variables in the causal network that map to more 
than one variable in the dissociation network. 

For easiness of representation, we also write P^(->s°) to 
denote P*(-is°,i°\s), in other words, all the variables ap- 
pearing in the argument list of P|! are set to the denoted 
values, and the ones not appearing to their original values 
given by x 3 . 

Since Xi is a cause of <f>, 3S such that AP$(S, Xi) 7^ and 
AP*($) = 0, V$ C [S] t . 

By definition of expansion, <B e (X) = $(A), and therefore 
Pj, = Pj, c , which means that AP$ c (S, Xi) 7^ for the same 
set S. 

If $ e is a node expansion of <£, then P# = P$ E and Pj< = 
P* e , and therefore Xi is a cause of 

If $ e is an edge expansion of by the addition of an 
edge (v, u) node v may become potent with respect to Xi 
in <3> e . If v is not potent, then [X]t = [X]t, and therefore 
P* c = P» , which means that Xi is also a cause of </> e - 

If v is potent with respect to Xi, then PA" 5 contains a 
set of replicated nodes V' of V (node v and its ancestors), 
which are not contained in DN, so [X] t C [X]t- Denote 
as X v the subset of X that are ancestors of v, and X v i the 
subset of X that are ancestors of the replica v' in DN e . 
Then X T = A t U X v > . 

If X„ n [S] t = 0, then {X v U X v } n [S) T = 0. Then, 
St = St, and for any S' t C St, S t ' is also a subset of St- 
That means that P* e (-.St°) = Pw^^t ) = $ e (x°), which 
means that AP* C (5^) = 0. 

If ln& / 0, then X„ n S T / and X„_, n S T f 0. For 
any S T C St, denote X{, = X„ n S T and X^ = X v > n S T . 
Assume X c C X{, is the set of all variables in X' v that have 
an equivalent in X' v r, i.e. their replicas are in X' v , . Then 
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X' v and X' v i can be rewritten as follows: X' v = Xi U X c and 
X' v i = X 2 UX' C . The replicas of Xi, X 2 , and X c in DX e are 
X[, X' 2 and X£ respectively, so X[ <£_ X' v , and X 2 <£ X' v . 
Also, denote S% = S' T \{X' V VJ X' v ,} . 



AP* C (S' t ) = P* c (jg°) - P£ c (^xi, -nSj!, ^ 2 °, 



-*/ -ll \ 



Also, we know that AP*(s* t ) = 0, V# C S t . Assign S' t = 
iiUl 2 Ul c U and S' t ' = S' t \{Xi U X 2 U X c }. Then 
St C St, and therefore: 

P^(->5?,->^,-i^,->4'°) = 
By definition of expansion: 

p / -jO J) -JO -WO Jll JO -"0\ _ 

DO / =j0 rjO -?/ ;J 



We compute P^ e (-if?, ->:Cc, -<x' 2 , ^a;' c u , re-written 

as: Pj/ e (~ '#1, ^27 — '^ci ^1 : ^ x 2 7 7 ~^ s t )• 

Since variables are not negated, P* and Pi> E are monotonous. 
Therefore: 



r> t-Q\ r>o r -o -0 -a j j jo -«0\ 
Py(x ) =P i g e {-ix 1 ,->X2,->3; c ,->Xi ,^x 2 ,->x c ,-i8 t )< 



-0 -0 JO JO J -//Ox ^ 

""1X2 , — 'X c , —iSt ) ^ 



^ -rfi ( -a -a -o -1 c 
< P* e (->a;i,X2,-'a; c ,i 1 

^ pO / — *0 J) J) JO JO JO J/0\ _ 
D 0/J) -*0 -JO r -J/iO \ D 

Therefore, P| e (-.5?, -.5a , -.a^ , ->Sr°) = P*(z°), which 
means that AP$ e (4) = 0, and Xi is a cause of <j> e . □ 

These two lemmas lead to the general theorems of formula 
expansion presented in |Sect. 3.2| 



Proof [Theorem 3. 3| (Formula Expansion). Since <3> e 
is an expansion of 3 ordered set of formulas {$1, $2, • • • , $fc} 
with $1 = <E> and $fc = $ e , such that is a single step 

expansion of Vi G [1, fcL 

if Xi — is a cause of </> ! then 



As shown in 



Lemma C.5 



it is also a cause of <f>%-i, Vl G [2, k]. Therefore, if Xi — x° a 
cause of 0^, it is also a cause of <f>\. □ 



Proof [Theorem 3.4 (Exp. of Positive Formulas). 
As shown by |Theorem 3.3| if something is a cause of <j> e , then 
it is also a cause of 0. We now need to show that if Xi — x( 
is a cause of (j>, then it is also a cause of cj> e . 

Since <5> e is an expansion of 3 ordered set of formulas 
{$1, $2, • . • , $fc} with $1 = $ and = $ e , such that 
is a single step expansion of $i, Vi G [1, fc] 



As shown in 



Lemma Cj 



if Xi 



is a cause of <1> ! 



then it is also a cause of $ 4 , Vi G [2, fc]. Therefore, if = . 
a cause of $1, it is also a cause of □ 

C.3 Markovian transitivity 



Pr.oof [Prop. 3.7| (Markovian transitivity). Here we 
denote as /y 4 the potential function of formula 3>y%, and 
Py ; the potential function of the DE of 3?^. To simplify 
notation, we omit the non-negated terms in the potential 
functions. So we write P(-^s°) meaning P(-is°, x°\if). 



Assume X is a functional cause of Yi with responsibil- 
ity px. Then there exists S Yl C X n AiVC(Yi), such that 
A/ yi (X,S n ) j4 0, and C [5 yi ] ( , AP n (5|) = 0, and 

Also, in the mutilated network, Yi is a cause of a Y 2 
with responsibility p 2 , then 3 a minimum set SV 2 C X n 
{ANC(Y 2 )\ANC(Yi)}, such that A/y 2 (X,Sy 2 ) / 0, and 
VSJ C [SV 2 ]t AP Y2 (S' t ) = 0, where 7y 2 and Py 2 the poten- 
tial functions of Y 2 in the mutilated CN and corresponding 

5 and 



DN. Also, p 2 



Obviously, SVj n Sy 2 



|Sy 2 l+l' 

[SVJt (~l [5y 2 ]t = 0. Since Yi is markovian, no ancestors of 
Y\ connect to the rest of the network without going through 
Yl Therefore, nodes in [SVJt also do not connect to the 
rest of the network without going through Y\ . 

Assume $ is the Boolean formula at Y 2 on the complete 
network. Also, denote X n = {X\X}C\ANC{Yl) and Xy 2 = 
{{X\X}nANC(Y 2 )}\X Yl . ThenU(x) = I^(x,x Yl ,S Y2 ) = 
I Y2 (I Yl (x,x Yl ),X X2 ). Similarly, in the DN we get: P*(a?) = 
Py 2 {P Yi {x,x Yi ),x Y2 ). 

Set S = S Yl n Sy 2 ■ Uhx°, 

lY 2 (I Yl hx°,^ Yl ),^ Y2 ) = I Y2 hV°l, 

fore, AJ4,(X,S) # 0. 

Assume set Si C [S) t . Set Si = S' t fl and S 2 = 

Si n [SVJt- Clearly, Si n & = 0. Then, P^sf) = 
P«(-.5?,-.5§) = Py 2 (P yi (^),^) = iV B (»?,^) = 2/2°. 
Therefore, AP$(S't) = for any Si- Therefore, X is a fun- 
ctional cause of Y 2 . 

S is also minimal, as S Yl and Sy 2 are minimal and disjoint. 
Therefore, X is a cause of Y 2 with responsibility 



5 ) = /*(^°,-4 1 ,-4 2 ) 



'V 2 



) = -?/2 U . 



|5| + 1 |S n | + |5y 2 | + l 



-1 

P2 



1) □ 



C.4 Complexity of functional cause 



Proof |Theorem 3.8| (Complexity) . In this proof, we de- 
note with 1$ the potential function of a formula <!>, and with 
P$ the potential function of the DE of $. Note that $ rep- 
resents a 3DNF formula, and not a DE of $. 

We use a reduction, inspired by the proof [9] Theorem 
3.3], from the non-tautology problem of a 3DNF: given a 
3DNF propositional formula ^ over a set of variables X = 
{Xi, . . . , X n }, is there a truth assignment for X that makes 
* false. 

We transform an instance of the 3DNF tautology prob- 
lem to a problem of determining whether a variable is a 
functional cause as follows. We create a dependent variable 
for every conjunct Cj in W with C = {Ci, . . . , C&}. Every 
variable Xi connects to every Cj it is part of. Eventually 
every Cj has 3 incoming edges. We also create a separate 
input node Xo and an output node Y with incoming edges 
from Xo and all the Cj G C, which applies the OR func- 
tion to its inputs. The final output node Y has formula 
Y = Xo V Ci V . . . V C k = Xo V* (see |Fig. 15 1. 

Assume initial assignment Xo = 1 and any assignment 
for X. Also name $ the Boolean formula representing node 
Y. We will show that \l/ is a tautology, iff Xo = 1 is not a 
cause of Y = 1. (1) If 9 is a tautology, then Y = 1 for all 
assignments, and therefore flS such that A7$(Xo,S) 7^ 0, 
as the potential function of the formula is also always 1. (2) 
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Y = X a \/d\/...\/C k 



Figure 15: Reduction from 3DNF tautology to de- 
termining functional causes in a Boolean network. 

If $ is not a tautology, then there exists an assignment x 1 
for which ^(x 1 ) = 0. We assign S = {X, | x\ / x°}. Then, 
clearly, AI<s,(X ,S) ^ 0, as h,(^x a ,^s° , x°\s) = 0. Also, 
for any S", AP^(S) = 0, as $ = 1 for X = 1. Therefore, if 
^ is not a tautology, Xq = 1 is a cause. 

By this, we have shown that determining whether a vari- 
able is a functional cause of an event in a Boolean causal 
network is NP-hard in the size of the network. □ 



Proof [Lemma 3.9] (Causality in Trees). If the causal 
network is a tree, then every node is Markovian. That means 
that there is a single path from any variable Xi to the effect 
variable Y, consisting of variables p = {Yi, . . . , Y}. 



In a tree causality is transitive ( Prop. 3.7 1 : if is a cause 



of Yj 6 p, and Yj is a cause of Y, then Xi is a cause of Y. 

We will show that if Xi is not a cause of Yj £ p, then Xi 
cannot be a cause of Y. 

Assume that Xi is a cause of Y. Then 3S such that 
AP Y (X t ,S) / and VS' C S, AP Y (S') = 0. Set Sy = 
S H A(Yj), where A(Yj) the subset of X that are ancestors 
of Yj. Also set S R = S\Sy- 

If 5 is inverted, Xi is counterfactual for Y: 



nS°,f°\{Xi,s}) 



Since there is only one path from Xi to Y through Yj, then 



Yj has to also flip values when S 



and Xi <— -nrf, 
meaning APy, (Xj, Sy) = AP Yj (X», S) £ 0. Also, Yj should 
have its original value when Xi is set to its original value 
with S inverted, otherwise Xi would not be counterfactual 
for Y. So, AP Yj (S Y ) = 0. 

Assume S' Y C Sy- AP Y (S' Y PI or) = 0, which means that 
AP Yj (S Y )=0. 

Therefore, Yj also has to be a cause of Y. □ 



Proof Theorem 3.10 'Restricted Arity). Follows di- 
rectly from [Lemma 3.9| Since the tree has restricted ar- 
ity < k, determining causality of a node for its immediate 
descendant is polynomial. Also, because of transitivity, as 
shown in ILemma 3.91 to show that X is a cause of Y it suf- 
fices to show that every node in the path p = {Yi, . . . , Y} is 
a cause to its immediate descendant. The length of the path 
grows with log n, and therefore determining whether X is a 
cause of Y is in P. □ 

Proof [Theorem 3. 11 ^ Primitive Operators). Assume 
Yj is the immediate descendant of Yi in a tree causal net- 
work. If the function of Yj is a primitive boolean operator, 
i.e. AND, OR, NOT, it can be decided in polynomial time if Yi 
is a cause of Yj . 



Case A: Yj is an AND node 

Set Sy = parents(Yj)\{Yi}. If j/° = 1, then Yi is a cause 
because it is counterfactual. If x/j = and j/f = 1, then Yj is 
not a cause because the AND function is monotone, so setting 
Yi to will not invert Yj under any contingency. If y° — 
and yi = 0, 3S C X that sets Sy to true. This is always 
possible because in a tree every input node participates in 
the formula exactly once. Therefore, Yi becomes counter- 
factual for S, and Yj is always when Yj is set to yi — 0, 
which means that AP$(S' t ) = for any St- This means that 
when y°j = and yf — 0, Yi is a cause of Yj. Therefore, it 
is determined in constant time whether a node is a cause of 
its immediate descendant AND node. 

Case B: Y, is an OR node 

Set Sy = parents(Yj)\{Yi} . If = 0, then Yj is a cause 
because it is counterfactual. If yj = 1 and yf = 0, then Yi is 
not a cause because the OR function is monotone, so setting 
Yi to 1 will not invert Yj under any contingency. If y° — 1 

and y° = 1, 3S C X that sets Sy to false. This is always 
possible because in a tree every input node participates in 
the formula exactly once. Therefore, Yi becomes counter- 
factual for S, and Yj is always 1 when Yi is set to y° — 1, 
which means that AP<s>(S' t ) — for any St- This means that 
when y° = 1 and y± = 1, Yi is a cause of Yj. Therefore, it 
is determined in constant time whether a node is a cause of 
its immediate descendant OR node. 
Case B: Yj is an NOT node 

Causality can be determined in constant time because 
there is a single input to the node. 

Therefore it is decidable in constant time whether Yi is 
a cause of its immediate descendant Yj. That means that 
to show that X is a cause of Y it suffices to show that 
every node in the path p = {Yl, . . . , Y} is a cause to its 
immediate descendant, each of which steps can be done in 
constant time. The length of the path grows with logn, and 
therefore determining whether X is a cause of Y is in P. □ 



Proof Theorem 3.13 (Positive DNF). Assume $afor- 
mula in DNF with no negated literals. Also, as shown by 
|Theorem 3.4| the network structure does not alter causality, 
so we will assume a star network. 

Case A: $ = 1 

There is polynomial transformation of $ to a minimal 
form 9, such that ^ = $ and $ only contains the minterm 
clauses of $. For example, if $ = (A A B) V (A A B A C) V 
(CAD), * = (A A B) V (C A D). The transformation is 
polynomial, as ^ includes a clause d of <E> only if flCj that 
contains a subset of the literals of CV We will show that a 
variable Xi is a functional cause of iff Xi £ d, where d a 
clause of ^ that evaluates to true under current assignment, 
and Xi = 1. 

First of all, if Xi is not in "if, then Xi is not a cause of $, 
as there is no assignment that makes Xi counterfactual for 
^ and therefore for $, as ^ = $. Therefore, Xi cannot be 
a cause of $. 

If Xi is in 'I', but VCi G "if that contain Xi, d evaluates 
to false, then Xi cannot be a functional cause, because of 
monotonicity since there is no negation. Any set S that 
makes Xi counterfactual, has to contain the variables of 
d whose initial assignment was 0, denote them with set 



S c C S, and 5" = S\Sa. Then *(^a;°, 



-is ) = and 
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1. * can be written as * — d V *'. 



This means that *'(- 



0. Therefore, -.s" u ) = 



(C?) V *'(^s' ) = 0, which means that AP*(S') / 0, so X 4 
is not a cause. 

Also, if xl — 0, because of monotonicity, it is not possible 
to switch a formula from 1 to 0, by flipping Xj from to 1. 

Now, if Xi £ Ci, where Cj a clause of 9 that evaluates 
to true under current assignment, and xi = 1, we select 
S = {Xj\Xj $ C\andx° = 1}. Then, if we write * = 
d V$', we know that ^'(-is ) = 0, because $ contains only 
minterms. That means that every clause Cj has at least one 
variable that is not in d, and therefore can be negated by 
the above choice of S. This makes Xi counterfactual with 
contingency S. Also $ = 9, therefore AP$ = AP* . Since 
Xi is counterfactual, AP*(Xi,S) / 0, and APp(S) = 0. 
Also, S does not contain any variables of d, and therefore, 
for any subset of S, Ci is true, and therefore is also true. 

Therefore, Xi is a cause of iff £ Ci, where Ci a 
clause of $ that evaluates to true under current assignment, 
and x° = 1, and this can be determined in polynomial time. 

Case B: $ = 

We again define the minterms transformation of <£> to 
but this time by first eliminating the variables whose initial 
assignment is 1. For example, if $ = (A A B) V (C A D) V 
(AAP), with initial assignment (a, 6, c, d, e)° = (0, 1, 0, 1, 0), 
then 9 — AV D. The clause (A A E) get eliminated because 
of the presence of minterm A. Similarly, we will show that 
Xi is a cause of <fi, iff Xi £ Ci, where Ci a clause of "J, and 
4 = 0. 

First of all, if Xi is not in 9, then Xi is not a cause of 
$, because either xi = 1, which eliminates it as a possible 
cause because of the monotonicity argument, or there was 
a minterm in 9 that caused its elimination. To make Xi 
counterfactual in $ through clause Ci, we need to invert all 
the variables Xj £ d for which x° = (S c = {Xj \ Xj £ 
Ci and x°j = 0}). But because there is a minterm in 
that contains a subset of these variables, the inversion would 
switch Cj to true. Xi will not be counterfactual unless we 
also invert at least one variable Xu £ Cj for which x% = 1. 
So 5 = S c U {X k }. Then APsfXj, S) / 0, but for S' = S c , 
AP$(S') 7^ 0, which means that Xi is not a cause. 

If Xi £ C and C £ *, set S = {Xj \ X 3 £ d and x° = 0}. 
That makes Xi counterfactual in <!> (and also in $), as we 
know that there are no other clauses in 9 that contain a 
subset of S causing them to result to true. Also, obviously, 
for any subset of S, 9 as well as $ result to 0, which is the 
initial assignment. 

Therefore, Xi is a functional cause of <J>, iff Xi £ Ci, where 
d a clause of \P, and xi = 0, and this can be determined in 
polynomial time. □ 



Proof Theorem 3.14 



(Positive CNF). Assume <E> a for- 
mula in CNF with no negated literals. Also, as shown by 
|Theorem 3.4| the network structure does not alter causality, 
so we will assume a star network. 
Case A: $ = 1 

We define the maxterms transformation of $ to 9, also 
eliminating variables with initial assignment 0. Due to mono- 
tonicity, any variable Xi with xi — cannot be a cause. As 
an example, if $ = (A V B V C) A (A V B) A (A V D), and 
initial assignment (a, b, c, d) — (1, 1, 1, 0), then 9 = A. This 
is because the clause (A V B V C) gets eliminated because 



of the presence of maxterm (Ay B), and because D — 0, 
(A V B) also gets eliminated because of the creation of max- 
term A. We will show that Xi is a functional cause of $ = 1 
iff Xi £ 9, which is computable in polynomial time. 

If Xi ^ 9, then either xi — 0, in which case Xi cannot 
be a cause, or Xi was part of an eliminated clause d. d = 
C'i V Xi was eliminated because there was another clause 
Cj £ $ which can be split into Cj = C^ V Cj , so that 
C~ evaluated to under given assignment, and C^ C C[. 
If Xi is a cause in $, then 35, s.t. AP< 6 (^X l , ->5) / 
and V5' C 5, AP#(5) = 0. There has to be S c C 5, such 
that Sc = {Xj I Xj £ Cj' and x® = 1}, in other words, we 
need to set to all variables in C'i in order to make Xi 
counterfactual, and C'i has to contain at least one variable 
set to true, otherwise Xi would not have been eliminated. 
Since Cj~ C C'i, then Cj C 5 e . That means that inverting 

S c would set Cj to false. For X,- to be counterfactual, S 
also needs to contain at least one variable from C~ , call it 

X~ . However, for S' = 5\{X~}, Cj would be set to false, 
so AP$(5') 7^ 0, which means that Xi cannot be a cause. 

If X t £ d £ * then set S = {Xj | Xj £ Ci and x { ] = 1}, 
where d = C\\ ' Xi. Inverting S does not invert $ (or $), as 
there is no Cj £ $ or 9 that is C, C C'i, otherwise d would 
have been eliminated since Cj would have been a maxterm. 
Therefore Xi is counterfactual with contingency S. Also 
VS' C S AP$(S") = 0, as no clause can be negated if we 
invert fewer positive terms due to monotonicity. Therefore 
Xi is a cause. 

Since the transformation from $ to $ is polynomial, cau- 
sality of Xi can be determined in polynomial time. 
Case B: $ = 

From $ we construct 9, which just contains the maxterms 
of For example, if $ = (A V B V C) A (A V B) A (A V D), 
then * = (A V P) A (A V £>). It is always * = $. We will 
show that Xi is a functional cause of 4>, iff Xi £ d, where 
Ci a clause in 9 that evaluates to false. 

Assume that Xi is a cause of $ in clause d = C t ' V Xi. 
Then 35, such that AP^Xi, ->§) and VS' C 5, 
AP»(5) = 0. If Cj' evaluates to 1 under given assignment, 
then 5 has to contain S c — {Xj \ Xj £ C[ and x°j = 1}. Also 
every clause Cj ^ d, has to be set to true after the inver- 
sion of 5, otherwise Xi would not be counterfactual. That 
means that 5 should also contain a subset Sr that sets all 
other clauses to 1. But then, for 5' = 5\5 e , the formula will 
evaluate to 1, because C[ will evaluate to one along with all 
other clauses, which would make Xj not a cause. Therefore, 
C[ evaluates to under given assignment. Still, the fact 
that S contains a variable Xj from every clause other than 
d means that every clause contains a variable Xj that is 
not contained in C[. Therefore, /9C, such that Cj C G[, 
and therefore d is a maxterm. Therefore, if Xi is a functio- 
nal cause of "I", then Xi £ d, where d a clause in W that 
evaluates to false. 

If Xi £ Ci, where d a clause in 9 that evaluates to false, 
then set 5 = {Xj | X 3 g d and x ( - = 0}. That will set all 
clauses apart from Ci to true. That is guaranteed because d 
is a maxterm, and therefore, flCj C Cj, where d = CjVXi. 
Then Xi is counterfactual for ^ and and also, VS' C S 
AP#(S') = 0, as Ci is stuck to false. Therefore, Xi is a 
functional cause. 

Since the maxterm transformation is polynomial, causa- 
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lity of Xi can be determined in polynomial time. □ 

D. DETAILS SECTION 4 
D.l Complexity of aggregates 



Proof [Lemma 4.4] (Sum possible causes). Assume w° 
the result of a summation query and X + the set of all the 
true tuples, and X~ the false ones. 

Case A: Q=Why SO?, op=">": (j> = (uj° > c) = 1. 

Assume Xi G X~. If Xi is a cause of <f>, there must exist 
support S C X\{Xi} such that AP(Xi,S) / 0. Also S is 
partitioned into S + = S n X + and <S~ = S fl X - . Assume 
u s , o;7 the sum of values in S, <S + and S~ respectively, 
and uit the value of t. Then uj° — u>^ + u>~ + uj t < c =>■ u>^ > 
u)° - c + u~ + u t > 0. So, S+ / 0. 

AISO, W°- Ujf + UJ~ + < C => OJ° — < C — LJ f — (jjj < C 

because all values are positive. Therefore, for S' t = S + C 5, 
AP(St) / 0, so a false tuple cannot be a cause of case A. 

Case B: Q=Why SO?, op=">": <j> = (uj° < c) = 1. 

Assume Xi G X + . If Xi is a cause of <j>, there must exist 
support S C X\{Xi} such that AP{Xi,S) / 0. Also S is 
partitioned into S + = S n X + and <S~ = 5 fl X - . Assume 
u s , ujJ the sum of values in S, S + and S~ respectively, 
and uit the value of t. Then ui° — tuf + lu^ — uj t > c =>■ wj" > 

c - UJ° - C + UJ + + UJ t > 0. So, S~ 0. 

Also, w — cuf + lu~ — uj t > c =>■ w° + oj7 > c + a; f + wj" > c 
because all values are positive. Therefore, for S' t = 5 1- C 5, 
AP(St) / 0, so a true tuple cannot be a cause of case B. 

Case C: Q=Why NO?, op="<": <j> = (w° < c) = 0. 

Assume Xi G X~. If Xi is a cause of <f>, there must exist 
support S C X\{Xi} such that AP(Xi,S) + 0. Also 5 is 
partitioned into 5 + = Sfl X + and <S~ = S H X - . Assume 
lj s , uj^ , o;7 the sum of values in S, <S + and S~ respectively, 
and UJt the value of t. Then uj° — u>^ + uij +uj t < c =>■ u>^ > 
lu° - c + uj- + UJt > 0. So, S+ / 0. 

AISO, UJ — Ujf + LU~ +■ UJt < C U>° — Ujf < C—UJt— UI~ < C 

because all values are positive. Therefore, for S't = S + C S, 
AP(S't) / 0, so a false tuple cannot be a cause of case C. 

Case C: Q=Why NO?, op=">": 4> = (w° > c) = 0. 

Assume Xi G X + . If Xi is a cause of <f>, there must exist 
support S C X\{Xi} such that AP(Xi,S) + 0. Also 5 is 
partitioned into 5 + = Sn X + and 5 _ — S f] X~ . Assume 
uj s , ujf, uj~ the sum of values in S, S + and S~ respectively, 
and UJt the value of t. Then uj° — tuf + lu^ —ui t > c =>■ ui~ > 

C-UJ° - C + UJ+ +LU t > 0. So, S~ 0. 

AISO, W° — Ujf + LU^~ — UJt > C=> LJ° + UJ~ > C + LU t + Ujf > C 

because all values are positive. Therefore, for St — S~ C 5, 
AP(S't) 7^ 0, so a true tuple cannot be a cause of case D. □ 



Proof [Prop. 4.5| (Why so? = Why no?). The premise 
of this statement is straightforward: If uj° the SUM value, 
and t is a cause of (j> — (ui° > c) =true. Define condition 
ifj = (u° < c) = {uj° t c). Clearly <j> => -nip, so if t is a cause 
of 4> it is also a cause of -iijj. 

Similarly, if t is a cause of ip (u>° < c)=true, then ip -i<j>, 
and therefore t is also a cause of STJM^ c. □ 



Initialize all K(0,j) = and all K(d, 0) = oo 
for j = 1 to n 

for d = 1 to u'-c + o 

if d < Vj\ K(d,j) = K(d,j - 1) 

else: K(d,j) = tam[K(d-Vj,j - 1) + l,K(d,j - 1)] 
return min[ii:(aj - c+ 1), . . . , _fsT(a; - c + «)] 

Figure 16: Pseudo-polynomial time algorithm to de- 
termine causes for sum over one input relation in 

O(n(uj -c + v)). 



Vi G N + be a given vector of positive integers, c a positive 
integer, and define fi(X) as the dot product fl(X) = XV, 
with X = [xi,. .. , x n ] and x% G {0, 1}, 1 < i < n represents 
a vector of binary variables. The subset sum problem is to 
find an assignment of binary values x° so that = c. 

We reduce the above SSP problem to the following Why 
SO? problem. Construct an ordered set of tuples T' with 
one attribute corresponding to the values of the vector V' 
with v'i = Vi, i G {1, . . . , n} and v' n+1 = 1. Now consider the 
aggregate SUM(a;' ) for actual assignment x'° with xf — 1 for 
i G {1, . . . ,n + 1}. Then t' n+1 is a Why SO? explanation for 
the aggregate condition (SUM(a;' ) > c + l) = true iff there 
is one assignment x' 1 with x'* +1 for which SUM(a;' 1 ) = c. 

Hence, we have reduced SSP to determining causality of 
tuples for the SUM aggregate. □ 



Proof [Theorem 4"7F] (Sum pseudo-PTIME). Determi- 
ning responsibility of a tuple t with value v for a Why 
SO? aggregate condition (SUM(£°) > c) = true is solvable in 
pseudo-polynomial time (D(nc) using the following dynamic 
programming algorithm. 

Consider the new set V* that consists of all values of V 
that are true under current assignment except for the value 
v. Let uj* = SUM(V*). Then we have to find a minimal 
subset V'* C V* whose values add up to a value in the closed 
interval [uj* — c + 1,uj* — c + v]. Now define the subproblem 

K(d,j) 

as the minimum subset size \V"*\ with values summing up 



to d for the subset of values {vi, 



j}. We then express 



K(d,j) in a way that either value Vj is needed to achieve 
the minimal value, or it isn't needed: 

K(d,j) = min[K(d-v h j - 1) + l,K(d,j - 1)] 

The answer we seek is the minimal value of {K(d, n) \ uj — 
c+l < d < lu° — c+v}. The algorithm then consists of filling 
out a two-dimensional table, with n rows and uj° — c + v + 1 



columns, hence in O(n(uj — c + v)) time (Fig. 16 1. □ 



Proof Theorem 4.6 (Sum hardness). We will use a re- 
duction from the subset sum problem (SSP): given n positive 
numbers and a target bound c, find a subset of the numbers 
summing to c. More formally: Let V = [v%, . . . , v n ] with 
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